Analyzing Loan Data from Prosper by Rajendra B Pal

Introduction

This report explores the Prosper Loan Dataset. Prosper is an online marketplace lending(also called peer-to-peer lending) that connects borrowers and investors. Investors can explore credit-worthy borrowers by FICO score, ratings, and terms and then invest in unsecured loans( or a part of the loan) and earn returns.

Why Analyse Prosper Dataset

I am a financial market enthusiast and follow it daily basis. I believe the financial markets are the reflection of human psychology. The year 2008 financial crisis was due to subprime mortgage loans, derivatives, too much leverage, and housing bubble that caused a global crisis and lead to the demise of financial firms, required trillion-dollar taxpayer bailouts, and caused a recession that matched the Great Depression in its magnitude.

Prosper loan is a peer-to-peer lending/loan marketplace in the United States. Prosper loan dataset contains records from year 2006 to 2014. I am trying to understand why debt delinquency happens and what can be done to reduce it. Is borrower financial situation(features like AmountDelinquent, LoanCurrentDaysDelinquent, Occupation, EmploymentStatus, CreditScoreRangeLower, CreditScoreRangeUpper etc. ) or lenders business model (features like ProsperRating..Alpha. , IncomeVerifiable, BorrowerAPR etc ) has role to play.

# Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

From the summary details, the average loan is $6,500. Majority of the loans are less than 10,000 as shown in the positively skewed plot thus mean($8,337) more than the median($6,500).

Next obvious question to ask what kind of services borrowers are buying from loan money.

Most of the borrowers are using the loan for debt consolidation which could be credit card debt and home improvement.

Next looking into employment status of borrower.

Most of loan takers are “Employed”, “Full-time” and “self-employed” - which suggest that there is high probabity that loan will be paid in full.

Next looking into Income Range of borrower.

Most of the borrower’s income is in the range of $ 25K - $ 75K.

Next looking into annual percentage rate, borrower’s APR(interest rate plus other costs ) for the loan.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

Borrower’s APR has a mean of 0.2188 or 21.88% more than the median (0.20976 or 20.97%). We also see loan APR more 30%, which is pretty high and tip borrower to delinquency.

Next we look into Loan status .

Frequencies of the loan that are “completed” and “current” are more. It would be intersting to see “Chargedoff” and “Defaulted” loan more closely in terms of dollar(s) amount and reasons behind it - may be features like “Employmentstatus”, “DebtToIncomeRatio”,“LoanCurrentDaysDelinquent” may give us some clue.

Next, it would be interesting to look at the borrower’s debt to income ratio. Higher debt to income ratio is a recipe for delinquency.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

Average borrower has a debt to income ratio of 0.27 or 27% less than the typical 33%.

Next, looking into Credit score.

As credit score appear to be normally distributed - Proposer’s overall risk is reasonable.

Plotting feature “LoanOriginationQuarter” could be impressive as US Housing buble burst resulting credit crisis and subsequent 2007-2009 recession. The market bottomed by June 2009.

Above frequency plot does show a sharp dip in loan origination starting “Q4 2008” and then slowly gaining momentum from “Q3 2009” onward.

Plotting Prosper ratings. Prosper Rating is a proprietary rating system that allows potential investors evaluation of loan applicants.

A large count of loan does not have Prosper Rating assigned to them. This may be because the proprietary rating system developed after the 2008-2009 credit crisis.

Let’s see the distribution of amount delinquent at the time credit profile pulled.

Above plot show that most of the borrowers have less than $1000 delinquent at the time credit was pulled.

Univariate Analysis

Looking into the summary of some the variables like “LoanOriginalAmount”,“MonthlyLoanPayment”,

##  LoanOriginalAmount MonthlyLoanPayment  BorrowerAPR     
##  Min.   : 1000      Min.   :   0.0     Min.   :0.00653  
##  1st Qu.: 4000      1st Qu.: 131.6     1st Qu.:0.15629  
##  Median : 6500      Median : 217.7     Median :0.20976  
##  Mean   : 8337      Mean   : 272.5     Mean   :0.21883  
##  3rd Qu.:12000      3rd Qu.: 371.6     3rd Qu.:0.28381  
##  Max.   :35000      Max.   :2251.5     Max.   :0.51229  
##                                        NA's   :25       
##  StatedMonthlyIncome MonthlyLoanPayment.1   Investors      
##  Min.   :      0     Min.   :   0.0       Min.   :   1.00  
##  1st Qu.:   3200     1st Qu.: 131.6       1st Qu.:   2.00  
##  Median :   4667     Median : 217.7       Median :  44.00  
##  Mean   :   5608     Mean   : 272.5       Mean   :  80.48  
##  3rd Qu.:   6825     3rd Qu.: 371.6       3rd Qu.: 115.00  
##  Max.   :1750003     Max.   :2251.5       Max.   :1189.00  
## 

What is the structure of your dataset?

The structure of Prosper Dataset has numeric and factor variables. Prosper Dataset has total 113937 observations of 81 variables.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

What is/are the main feature(s) of interest in your dataset?

I intend to understand the reasons behind borrower delinquency and Prosper/ Investor response to credit lending practices.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

My focus is on the following features - “LoanOriginalAmount”,“ListingCategory..name”,“EmploymentStatus”, “IncomeRange”,“BorrowerAPR”,“LoanStatus”,“DebtToIncomeRatio”,“CreditScoreRangeLower”,“CreditScoreRangeUpper”,“LoanOriginationQuarter”, and “ProsperRating..Alpha.”

Did you create any new variables from existing variables in the dataset?

I did create two variables namely “ListingCategory_name” and “level_order”

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I did re-order Loan Origination Quarter to plot histogram.

Bivariate Plots Section

Note: https://www.prosper.com/plp/general-prosper_score/ APRs through Prosper range from 6.95% (AA) to 35.99% (HR) for first-time borrowers, with the lowest rates for the most creditworthy borrowers.

It looks like majority of loans given upto 2008-2009 does not show proper rating bute after “Q2 2009” most of the loans are of “AA”, “A” or “B” i.e. loan with low percentage.

After loan credit crises due to the housing bubble, minimum credit score was raised.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591

After loan credit crises due to the housing bubble, there is not only increase in minimum credit score but the implementation of the Prosper Rating system to help investors buy the loan with risk awareness.

Let’s also look into Prosper rating versus BorrowerAPR.

The plot shows that as rating go lower borrower’s APR increases. That’s normal. Outliers are removed.

summary(df$Investors)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00

We see a trend of investors investing in higher loan dollars per Prosper rating system. Most of the investors are investing in Prosper rating “B” and “C” category - which appears driver purely by coupon yield.

Lets group borrower by income range and look for loan amount mean.

Higher the income range - higher the loan amount provided to the borrower. At the same time, we have to be careful about borrower’s debt to income ratio (below 33% is considered safe). We are plotting next

Lets look into level of order IncomeRange Versus Debt to income ratio

So the plot shows that lower income range has higher “DebtToIncomeRatio” compare to the higher income group thus lower-income group more risk to delinquency.

Comparing Prosper rating Vs. Credit score, original loan amount, and amount delinquent.

Most of the delinquent dollars are less than $1000 of the borrowed original $10000(mean) is mainly concentrated towards “HR”" Prosper rating and not so bad lower credit range.

Here is some correlation matrix of selected features

There is a strong correlation between “LoanOriginalAmount” and “MonthlyLoanPayment.”

## 
##  Pearson's product-moment correlation
## 
## data:  df$LoanOriginalAmount and df$MonthlyLoanPayment
## t = 867.82, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9312165 0.9327426
## sample estimates:
##       cor 
## 0.9319837

As the loan amount increases so is the monthly payment - that’s obvious.Corelation is quite strong.

## # A tibble: 2,468 x 5
##    LoanOriginalAmo~ mean_AmountDeli~ mean_BorrowerAPR mean_CreditScor~
##               <int>            <dbl>            <dbl>            <dbl>
##  1             1000               NA          NA                    NA
##  2             1001               NA           0.206                NA
##  3             1005               NA           0.340               530
##  4             1010             6683           0.0898              540
##  5             1025               NA           0.204                NA
##  6             1030               NA           0.271               600
##  7             1031             9720           0.236               690
##  8             1032             3207           0.263               560
##  9             1035               NA           0.188               500
## 10             1036               NA           0.0866              820
## # ... with 2,458 more rows, and 1 more variable: n <int>

## 
##  Pearson's product-moment correlation
## 
## data:  df.df_by_LoanOriginalAmount$LoanOriginalAmount and df.df_by_LoanOriginalAmount$mean_BorrowerAPR
## t = -18.579, df = 2451, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3855602 -0.3161646
## sample estimates:
##        cor 
## -0.3513449
## 
##  Pearson's product-moment correlation
## 
## data:  df.df_by_LoanOriginalAmount$LoanOriginalAmount and df.df_by_LoanOriginalAmount$mean_CreditScoreRangeLower
## t = 23.743, df = 2358, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4061005 0.4712489
## sample estimates:
##      cor 
## 0.439252

The above plot shows that there is a negative correlation(-0.3513449) between “Groupby(Loan Original Amount)” and “Mean Borrower APR.” As “Groupby(Loan Original Amount)” increases there is a decrease in “Mean Borrower APR.”

The above plot shows that there is a positive correlation(0.439252) between “Groupby(Loan Original Amount)” and “Mean Credit Score Range Lower.” As “Groupby(Loan Original Amount)” increases there is an increase in “Mean Credit Score Range Lower.”

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

After loan credit crises due to the housing bubble, there is not only increase in minimum credit score but the implementation of the Prosper Rating system to help investors buy the loan with risk awareness.

Better quality of loan after “Q2 2009”.

Investors investing in higher loan dollars for “AA”, “A” ratings.

Higher the income range higher loan amount. But we have to watch out for borrower’s debt to income ratio (preferably below 33%)

Lower income range borrower has higher “DebtToIncomeRatio” compare to the higher income group thus lower-income group more risk to delinquency. Prosper ratings are inversely proportional to borrower’s APR.

There is a negative correlation(-0.3513449) between “Groupby(Loan Original Amount)” and “Mean Borrower APR.” As “Groupby(Loan Original Amount)” increases there is a decrease in “Mean Borrower APR.”

There is a positive correlation(0.439252) between “Groupby(Loan Original Amount)” and “Mean Credit Score Range Lower.” As “Groupby(Loan Original Amount)” increases there is an increase in “Mean Credit Score Range Lower.”

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Most of the delinquent dollars are less than $1000 of the borrowed original $10000(mean) is mainly concentrated towards poor Prosper rating and not so bad lower credit range or upper credit range.

What was the strongest relationship you found?

There is a strong relationship between Prosper rating and Borrower’s APR.

Multivariate Plots Section

Most of the loan issued fall under 36-months Term. There are excess of loans with Prosper rating of “HR”, “E” and “D” and increasing APR. The quantity of loans debt issued for 12 months is quite less in comparison to 36 and 60 months. There appears to be a fair distribution of borrowers " Amount Delinquent " at the time credit was pulled, and it may be because most of the borrowers are trying to consolidate debt.

Most of the delinquent dollars between 200-700 have more than 200 days delinquent. Most of the delinquent data points are from Prosper rating “D”, “E”, and “HR” - which seems logical. We do see some Prosper rated loan “A” in delinquency - which raises doubt over faithfulness of the Prosper rating system.

Now looking into the relation between feature “Investors,” “LoanOriginalAmount,” and Prosper rating.Note: Entries where Prosper rating not listed are filtered.

## 
##  Pearson's product-moment correlation
## 
## data:  my_df_01$Investors and my_df_01$LoanOriginalAmount
## t = 98.125, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3131790 0.3252646
## sample estimates:
##       cor 
## 0.3192348

The above plot shows that there is a positive correlation(0.3192348) between “Investors,” and “LoanOriginalAmount.”

Additionally, we see that there is a gradual increase in “Investors” count for “LoanOriginalAmount” for higher rated “AA” and “A” borrowers. Poor rated have a lower count of “Investors”. Now, this makes sense and Prosper rating were implemented after the Real Estate bubble burst - “Q3 2009”.

The above plot shows that most of the loans were for 36 months and the estimated effective yield is higher as the Prosper rating decreases i.e., “C”, “D”, “E”, and “HR”. It appears that the lure of higher yield motivated the investors to invest in loans of the poor rating. This is more the reflection of time before the credit crisis where robot signing was the norm leading to world-wide recession.

We see trend of better quality,Prosper rated “AA”, loan after from year 2009. We do see some poor quality rated loan like “D” and “E” - which indicates investors affinity for higher yield. It would be interesting to see Prosper Dataset from the year 2014 onwards to know Investor sentiment towards yield when the federal interest rate is on the rise

Let’s look into the feature “Income Verifiable versus”Loan Original Amount."

It seems strange to know how Prosper rated borrowers if borrowers’ income were not verified. Maybe credit score was used. The plot shows that the number of loans given to borrowers where income was verified is more in comparison to not. Also, there is a gradual increase in the size of the loan amount as income is checked and after “Q2 2008” as Proper rating were implemented.

Linear model for predicting “BorrowerAPR”

lets create liner model m1 with response variable: BorrowerAPR and predictor variable: LoanOriginalAmount

Additionally updating: model m1 by m2 by adding variable “ProsperRating..Alpha.” , model m2 by m3 by adding variable “DebtToIncomeRatio”, model m3 by m4 by adding variable “Term” and in the linear regression

m1 = lm(BorrowerAPR ~ LoanOriginalAmount, data = df)
m2 = update(m1, ~. + ProsperRating..Alpha.)
m3 = update(m2, ~. + DebtToIncomeRatio)
m4 = update(m3, ~. + factor(Term))

mtable("Model 1" = m1,
       "Model 2" = m2,
       "Model 3" = m3,
       "Model 4" = m4)
## 
## Calls:
## Model 1: lm(formula = BorrowerAPR ~ LoanOriginalAmount, data = df)
## Model 2: lm(formula = BorrowerAPR ~ LoanOriginalAmount + ProsperRating..Alpha., 
##     data = df)
## Model 3: lm(formula = BorrowerAPR ~ LoanOriginalAmount + ProsperRating..Alpha. + 
##     DebtToIncomeRatio, data = df)
## Model 4: lm(formula = BorrowerAPR ~ LoanOriginalAmount + ProsperRating..Alpha. + 
##     DebtToIncomeRatio + factor(Term), data = df)
## 
## =============================================================================================
##                                 Model 1         Model 2         Model 3         Model 4      
## ---------------------------------------------------------------------------------------------
##   (Intercept)                      0.253***        0.201***        0.200***        0.196***  
##                                   (0.000)         (0.000)         (0.000)         (0.001)    
##   LoanOriginalAmount              -0.000***       -0.000***       -0.000***       -0.000***  
##                                   (0.000)         (0.000)         (0.000)         (0.000)    
##   ProsperRating..Alpha.: A                        -0.053***       -0.052***       -0.053***  
##                                                   (0.000)         (0.000)         (0.000)    
##   ProsperRating..Alpha.: AA                       -0.102***       -0.101***       -0.101***  
##                                                   (0.001)         (0.001)         (0.001)    
##   ProsperRating..Alpha.: B                        -0.008***       -0.007***       -0.008***  
##                                                   (0.000)         (0.000)         (0.000)    
##   ProsperRating..Alpha.: C                         0.033***        0.034***        0.033***  
##                                                   (0.000)         (0.000)         (0.000)    
##   ProsperRating..Alpha.: D                         0.085***        0.086***        0.084***  
##                                                   (0.000)         (0.000)         (0.000)    
##   ProsperRating..Alpha.: E                         0.133***        0.133***        0.132***  
##                                                   (0.001)         (0.001)         (0.001)    
##   ProsperRating..Alpha.: HR                        0.158***        0.159***        0.158***  
##                                                   (0.001)         (0.001)         (0.001)    
##   DebtToIncomeRatio                                                0.002***        0.002***  
##                                                                   (0.000)         (0.000)    
##   factor(Term): 36/12                                                              0.004***  
##                                                                                   (0.001)    
##   factor(Term): 60/12                                                              0.009***  
##                                                                                   (0.001)    
## ---------------------------------------------------------------------------------------------
##   R-squared                        0.104           0.716           0.707           0.707     
##   adj. R-squared                   0.104           0.716           0.707           0.707     
##   sigma                            0.076           0.043           0.043           0.043     
##   F                            13258.004       35811.700       28189.519       23117.848     
##   p                                0.000           0.000           0.000           0.000     
##   Log-likelihood              131830.816      197159.800      181848.123      181935.419     
##   Deviance                       658.982         209.283         195.437         195.113     
##   AIC                        -263655.632     -394299.600     -363674.247     -363844.838     
##   BIC                        -263626.702     -394203.168     -363569.030     -363720.492     
##   N                           113912          113912          105358          105358         
## =============================================================================================

From the model out we can interpret that with R-squared and adjusted R-squared equal 0.707 i.e 70.7 % and p-value = 0 it is a nice model.

Also we can see that slope turn from negative to positive as Proper rating goes shift from A towards HR, prediting higher Borrower APR.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

As “Groupby(Loan Original Amount)” increases there is a decrease in “Mean Borrower APR.”
As “Groupby(Loan Original Amount)” increases there is an increase in “Mean Credit Score Range Lower.”

Though small, it is visible that loan given with poor Prosper rating, i.e., “C”, “D”, “E”, and “HR” tend to have delinquencies(though small amount) as borrower APR increases.

Were there any interesting or surprising interactions between features?

It appears that the lure of higher yield motivated the investors to invest in loans of the poor rating. This is more to the reflection of time before the credit crisis where robot signing was the norm - leading to recession.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I build a model with - response variable: BorrowerAPR and predictor variable: LoanOriginalAmount

Additionally updating: model m1 by m2 by adding variable “ProsperRating..Alpha.” , model m2 by m3 by adding variable “DebtToIncomeRatio”, model m3 by m4 by adding the variable “Term” and in the linear regression Though I got relatively high R-squared and adjusted R-squared value( equal to 0.707) - it does not necessarily indicate model is a good fit. We need to evaluate other model statistics to get a comprehensive insight.


Final Plots and Summary

Plot One

Description One

I have chosen this plot because it shows a vivid dip starting quater “Q2 2008”. Well, Bear Sterns collapsed in March 2008 and set the stage for worst debt crisis since the Great Depression. The Prosper dataset has reflections of those times - the crash of the real estate bubble and the collapse of commercial credit from the last recession.

Additinally,(https://www.sec.gov/litigation/admin/2008/33-8984.pdfhttps://www.sec.gov/litigation/admin/2008/33-8984.pdf) on November 24, 2008, the SEC found Prosper to be in violation of the Securities Act of 1933. As a result of these findings, the SEC imposed a cease and desist order on Prosper … In July 2009, Prosper reopened their website for lending (“investing”) and borrowing after having obtained SEC registration for its loans (“notes”). After the relaunch, bidding on loans was restricted to residents of 28 U.S. states and the District of Columbia. Borrowers may reside in any of 47 states, with residents of three states (Iowa, Maine, and North Dakota) not permitted to borrow through Prosper.

Resulting, government introduced regulations (https://en.wikipedia.org/wiki/Regulatory_responses_to_the_subprime_crisis) to better manage risk. Prosper introduced the rating system to help the investor invest prudently. Thus the plot show after “Q3 2009” a gradual increase in loan origination with better risk management in place. As time passes humans forget - regulations are rolled back(https://www.nytimes.com/2018/01/01/us/politics/trump-businesses-regulation-economic-growth.html) - and the stage is set for next recession.

Plot Two

## # A tibble: 2,468 x 6
##    LoanOriginalAmo~ mean_AmountDeli~ mean_BorrowerAPR mean_CreditScor~
##               <int>            <dbl>            <dbl>            <dbl>
##  1             1000               NA          NA                    NA
##  2             1001               NA           0.206                NA
##  3             1005               NA           0.340               530
##  4             1010             6683           0.0898              540
##  5             1025               NA           0.204                NA
##  6             1030               NA           0.271               600
##  7             1031             9720           0.236               690
##  8             1032             3207           0.263               560
##  9             1035               NA           0.188               500
## 10             1036               NA           0.0866              820
## # ... with 2,458 more rows, and 2 more variables:
## #   mean_DebtToIncomeRatio <dbl>, n <int>

## 
##  Pearson's product-moment correlation
## 
## data:  df.df_by_LoanOriginalAmount$LoanOriginalAmount and df.df_by_LoanOriginalAmount$mean_BorrowerAPR
## t = -18.579, df = 2451, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3855602 -0.3161646
## sample estimates:
##        cor 
## -0.3513449
## 
##  Pearson's product-moment correlation
## 
## data:  df.df_by_LoanOriginalAmount$LoanOriginalAmount and df.df_by_LoanOriginalAmount$mean_CreditScoreRangeLower
## t = 23.743, df = 2358, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4061005 0.4712489
## sample estimates:
##      cor 
## 0.439252
## 
##  Pearson's product-moment correlation
## 
## data:  df.df_by_LoanOriginalAmount$LoanOriginalAmount and df.df_by_LoanOriginalAmount$mean_DebtToIncomeRatio
## t = -0.60306, df = 2094, p-value = 0.5465
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05596120  0.02965434
## sample estimates:
##         cor 
## -0.01317758

Description Two

The above plot shows that there is a negative correlation(-0.3513449) between “Groupby(Loan Original Amount)” and “Mean Borrower APR.” As “Groupby(Loan Original Amount)” increases there is a decrease in “Mean Borrower APR.”

There is a positive correlation(0.439252) between “Groupby(Loan Original Amount)” and “Mean Credit Score Range Lower.” As “Groupby(Loan Original Amount)” increases there is an increase in “Mean Credit Score Range Lower.”

There is a negative correlation(-0.01317758 ) between “Groupby(Loan Original Amount)” and “Mean mean_DebtToIncomeRatio.” - which is good trend. Borrowers with the higher range of lower credit score given lower ARP for increasing loan amount. This makes sense for Borrowers having less than 33%( as a general rule ). As the debt-to-income ratio increases, borrowers ability to pay monthly debt payments become difficult and may lead to delinquency. (Ref:https://www.consumerfinance.gov/ask-cfpb/what-is-a-debt-to-income-ratio-why-is-the-43-debt-to-income-ratio-important-en-1791/)

Plot Three

Description Three

Most of the delinquent dollars from $200-$700 have more than 200 days delinquencies. Most of the delinquent data points are from Prosper rating “D”, “E”, and “HR” - which seems logical. We do see some Prosper rated loan “A” in delinquency - which raises doubt over faithfulness of the Prosper rating system.
Not sure why 36-months Term is more popular than 12 months or 60 months - Maybe it is a sweet spot where ARP and monthly payment are just appropriate.


Reflection

My goal was to understand why debt delinquency happens and what can be done to reduce it. Prosper dataset has 113937 observations of 81 variables. I struggled in choosing the right features from this huge dataset, and it took a lot of time.

At times I used conditional means(library - dpyr ) for analysis and plotting.

I would prefer to analyze Proper dataset up to the year - 2018 to better understand how their rating system is helping investors to make the prudent debt investment. Prosper loans are unsecured - a high APR range ( 6.95% - 35.99% ) and poor Maximum debt-to-income ratio: 50% (excluding mortgage) is quite risky for the investor as well as borrowers(Ref: https://www.nerdwallet.com/blog/loans/prosper-personal-loans-review/)

As of August-2018, With low unemployment, 3.9% (ref:https://tradingeconomics.com/united-states/unemployment-rate) and decresing government regulations it will interesting see how Prosper has morphed its business model - latest dataset analysis can provide interesting insights.

Refrence:

https://stackoverflow.com/questions/1330989/rotating-and-spacing-axis-labels-in-ggplot2

https://en.wikipedia.org/wiki/Regulatory_responses_to_the_subprime_crisis

https://www.consumerfinance.gov/ask-cfpb/what-is-a-debt-to-income-ratio-why-is-the-43-debt-to-income-ratio-important-en-1791/

https://stackoverflow.com/questions/21801950/remove-legend-entries-for-some-factors-levels

https://stackoverflow.com/questions/22466479/correct-usage-of-scale-fill-manual-to-create-multi-colored-histogram-bars-in-g